Examining racial discrimination in the US job market

Background

Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.

Data

In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.

Note that the 'b' and 'w' values in race are assigned randomly to the resumes.

Exercise

You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.

Answer the following questions in this notebook below and submit to your Github account.

  1. What test is appropriate for this problem? Does CLT apply?
  2. What are the null and alternate hypotheses?
  3. Compute margin of error, confidence interval, and p-value.
  4. Discuss statistical significance.

You can include written notes in notebook cells using Markdown:

Resources



In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))


Out[2]:

In [20]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')
data.head()


Out[20]:
id ad education ofjobs yearsexp honors volunteer military empholes occupspecific ... compreq orgreq manuf transcom bankreal trade busservice othservice missind ownership
0 b 1 4 2 6 0 0 0 1 17 ... 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1 b 1 3 3 6 0 1 1 0 316 ... 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
2 b 1 4 1 6 0 0 0 0 19 ... 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
3 b 1 3 4 6 0 1 0 1 313 ... 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
4 b 1 3 3 22 0 0 0 0 313 ... 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 Nonprofit

5 rows × 65 columns


In [4]:
# number of callbacks for black-sounding names
sum(data[data.race=='b'].call)


Out[4]:
157.0

In [5]:
#number of callbacks for white-sounding names
sum(data[data.race=='w'].call)


Out[5]:
235.0

In [6]:
black = data[data.race=='b']
white = data[data.race=='w']

In [18]:
len(black)


Out[18]:
2435

In [19]:
len(white)


Out[19]:
2435

In [8]:
black_called = len(black[black['call']==True])
black_notcalled = len(black[black['call']== False])
white_called = len(white[white['call']==True])
white_notcalled = len(white[white['call']== False])

In [16]:
#probability of a white sounding name getting call back
prob_white_called = white_called/len(white)
prob_white_called


Out[16]:
0.09650924024640657

In [17]:
prob_black_called = black_called/len(black)
prob_black_called


Out[17]:
0.06447638603696099

In [35]:
#probablity of a resume gettting a callback
prob_called = sum(data.call)/len(data)
prob_called


Out[35]:
0.080492813141683772

In [15]:
results = pd.DataFrame({'black':{'called':black_called,'not_called':black_notcalled},
                       'white':{'called':white_called,'not_called':white_notcalled}})
results


Out[15]:
black white
called 157 235
not_called 2278 2200

Does the central limit theorem apply? What statistical test is appropriate?

CLT applies (in it's most common form) for a sufficiently large set of identically distributed independent random variables. Each of these data points if independent and drawn from the same probability distribution (we assume). If the resumes were sent out to a representative collection of potential employers then this shouldn't be an issue.

As well we see from our results table that for both categories $np >10$ and $n(1-p) >10$.

We note however, that unlike the previous project, there are only two binary states for two types of data points, making for four categories in total. Thus, an appropriate statistical test would be Pearson's $\chi^{2}$ test. It is a a statsitcal test which applies to sets of categorical data like we have here. The CLT tells us that for large sample sizes the disbtibution will tend toward a multivariate normal distribution.

The alternative would be to directly compare the rate of call back between the two groups, which would be a two-tailed z-test of the difference of two proportions. Both of these methods were covered in the Khan academy links provided by spring board.

What are the null and alternate hypotheses?

The null hypotheses is that what race a name sounds like should not have any predictive effect on the rate of callbacks. The alternate hypotheses is that there will be a statistical difference between the two groups. Since the number of black and white names was the same, and the names were randomly assigned to identical resumes (thereby removing the potential real-world biases of different education, experience levels or other advtanges/disadvatages), we expect to see the same number of total callbacks under the null hypothesis. Clearly this is not the case, as we see white names have a $9.65\%$call back rate in contrast to black names having $6.45\%$ call backs.

Compute margin of error, confidence interval, and p-value


In [48]:
#get expected proportions of each group 
total_called = sum(data.call)
total_notcalled = len(data) - sum(data.call)
total_called/2


Out[48]:
196.0

In [49]:
#Use a chisquare test with 1 degree of freedom since (#col-1)*(#rows-1) = 1. 
result_freq = [black_called, white_called, black_notcalled, white_notcalled]
expected_freq = [total_called/2,total_called/2,total_notcalled/2,total_notcalled/2]

stats.chisquare(f_obs=result_freq, f_exp = expected_freq, ddof=2)


Out[49]:
Power_divergenceResult(statistic=16.879050414270221, pvalue=3.983886837585076e-05)

We obtain $\chi^2 = 16.9$ with a p-value of $3.98\times10^{-5}$. This is highly significant, well below standard thresholds. We can conclude that it is very likely that perceived race of name plays a role in the rate of callbacks.


In [67]:
#calculate standard error, using pooled data. 
stderr = np.sqrt(prob_called*(1-prob_called)*(1/len(black)+1/len(white)))
print(stderr)
#get z score
z_score = (prob_white_called - prob_black_called) / stderr
z_score


0.00779689403617
Out[67]:
4.1084121524343464

In [63]:
#Can also compute a p-value for the difference in proportions directly, 
#apart from that obtained from our chi-squared test. Note that this equivalent, gives same result.
#Use norm.sf since it's a positive z value. (more accurate than 1-.cdf)
p_value2 = stats.norm.sf(z_score)*2
p_value2


Out[63]:
3.9838868375850767e-05

In [66]:
#Now a 95 percent confidence interval, two tailed corresponds to
z_critical = stats.norm.ppf(.975)
z_critical


Out[66]:
1.959963984540054

In [70]:
std_err_unpooled = np.sqrt(prob_black_called *(1-prob_black_called)/len(black)+ 
                           prob_white_called*(1-prob_white_called)/len(white))
conf_interval = [prob_white_called-prob_black_called - z_critical*stderr,
                 prob_white_called-prob_black_called + z_critical*stderr ]
conf_interval


Out[70]:
[0.016751222707276352, 0.047314485711614819]

Thus we are confident that there is a 95% chance that the true difference in callback rate for black and white names is within this range.

statistical significance

For our p-value test, we found that we would expect this result from random chance less than 4 times in 100000. Our confidence is thus very high that the perceived effect is not due to random chance.

As for our confidence interval, we have calculated the range by setting our error rate at 5%. So less than 5% would random chance lead to a difference in proportions greater than


In [23]:
len(white[white['sex']=='f'])


Out[23]:
1860

In [24]:
len(white[white['sex']=='m'])


Out[24]:
575

In [26]:
len(black[black['sex']=='f'])


Out[26]:
1886

In [25]:
len(black[black['sex']=='m'])


Out[25]:
549

In [32]:
print(sum(white[white['sex']=='f'].call)/len(white[white['sex']=='f']))
print(sum(white[white['sex']=='m'].call)/len(white[white['sex']=='m']))


0.0989247311828
0.0886956521739

In [33]:
print(sum(black[black['sex']=='f'].call)/len(black[black['sex']=='f']))
print(sum(black[black['sex']=='m'].call)/len(black[black['sex']=='m']))


0.0662778366914
0.0582877959927

We do note that there is not an even split by sex however. While female resumes have a higher callback rate than male resumes among both groups, it is the black sounding resumes that had more feminine names, so this should work to shrink the difference.

It is possible that some of the difference in call back rates is attributed to the sex of a name as well as its percieved race. Using the pool of names,one should check if the white name database housed more gender-neutral names for the resume. Some of the statistical significance in the difference between the two groups may be attributed therefore to a different form of bias on the part of potential employers. When looking through the paper referenced in the link given at the start of the notebook however, we see that the first names do not seem gender-netural. The possible exception is 'Brett' as a white masculine name, which can also be used as a feminime name, but has fallen out of favour in modern times. Therefore

So by doing a bit of separate legwork we are confident that are hypotheses statistical significance is not undermined by the sex split.


In [ ]:
white_called =sum(data[data.race=='w'].call)
black_called = sum(data[data.race=='b'].call)
black_nocall = sum(data[])